Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
8682f8d to
56ff5d0
Compare
1d5e6ac to
e559e5e
Compare
56ff5d0 to
11d859c
Compare
11d859c to
855e3a7
Compare
d09d5b0 to
a0ad392
Compare
| InvocationError error | ||
| Metadata Metadata | ||
| CreatedAt time.Time | ||
| ModelThoughts []*ModelThoughtRecord |
There was a problem hiding this comment.
How much space could ModelThoughts take?
Maybe we should consider configuring recording things not only due to compliance reasons but also to limit space usage.
There was a problem hiding this comment.
It all depends on how many thinking tokens are given to the model, and even then we're only capturing a summary, so I suspect it won't be much. In any case, having this information will help provide insight into why agents take the course of action they do, which may end up being valuable for an audit scenario.
In any case, disk is cheap.
We can later add a flag to not store these conditionally.
|
|
||
| // Clear after first use to avoid duplicating across | ||
| // multiple tool calls in the same message. | ||
| thoughtRecords = nil |
There was a problem hiding this comment.
Maybe it is a bit out of scope but this is a bit confusing to me and it may be a good time for a small cleanup.
If I understand correctly clearing thoughtRecords should not matter right now as parallel tool calls are disabled but it is added just in case they are enabled in the future? Or it is just to make sure that same thoughts are not stored in 2 tool calls? In later case I think it would be better to have duplicated information in 2 calls then 1 call with none.
At the same time construction of thoughtRecords is not prepared for parallel case, it just concatenates all thinking blocks from Content. How about constructing thoughtRecords slice in this for loop and clearing it on each RecordToolUsage call?
I see later thoughtRecords are used in range pendingToolCalls loop but I think those calls should also be processed in this loop. Then it should be possible to aggregate thinking blocks per call properly (assuming thinking blocks are properly ordered with tool calls dividing them) and would make code a bit simpler. Maybe tool call processing could even be extracted to base struct.
There was a problem hiding this comment.
I think this is related to @pawbana comment, but just to clarify: IIUC we're storing all thinking records on the first tool call, and subsequent calls get none. Not sure how we're planning to present this in the UI, but all thinking would be associated with a single tool call, which might not be accurate.
Is the purpose of this mapping between tool calls and thinking to "understand why the model chose to use this tool"? If so, I like @pawbana suggestion of aggregating thinking blocks per call. Otherwise, I'm not sure this mapping of tool calls to thinking is the right approach 🤔
There was a problem hiding this comment.
I've updated the description which should answer both of your questions 👍
There was a problem hiding this comment.
We only track model thoughts in relation to tool use; all our thoughts are irrelevant in the same way that we don't store regular inference responses.
I don't like that some thoughts are irrelevant but all thoughts will be recorded with a first tool call regardless if thoughts are related to that call or not.
I see why trigger to record thoughts is existence of some tool call in response but I don't like connecting thoughts to tool call since there is no strong connection between them / some thoughts may be completely irrelevant.
Right now the two supported APIs do not "interleave" thoughts per tool use: all the thinking is done upfront which results in the multiple tool uses.
If thoughts are not interleaven between tool calls why there are recorded using RecordToolUsage? Because this is the only thing that is recorded per inner agentic loop iteration?
If my assumption about inner loop is correct I understand why this is done this way but recording all thoughts with first tool call feels like a hack. How about introducing new method RecordThoughts? In #203 there is already separate table for thoughts.
I fell like following structure better reflects reality:
- interception has:
- multiple loops, each loop has:
- one thoughts record
- can have multiple tool call records (when parallel tool calls are enabled).
- multiple loops, each loop has:
For each loop we could match thoughts to tool calls by time (since each loop iteration should be quite slow) or we could add new field that would store on which iteration of inner loop stuff happened.
With deprecation of inner loop we could simplify this into interception having one thought record + multiple tool call records.
|
Maybe I'm missing something but is there a reason thinking blocks are merged into |
|
|
||
| // Clear after first use to avoid duplicating across | ||
| // multiple tool calls in the same message. | ||
| thoughtRecords = nil |
There was a problem hiding this comment.
I think this is related to @pawbana comment, but just to clarify: IIUC we're storing all thinking records on the first tool call, and subsequent calls get none. Not sure how we're planning to present this in the UI, but all thinking would be associated with a single tool call, which might not be accurate.
Is the purpose of this mapping between tool calls and thinking to "understand why the model chose to use this tool"? If so, I like @pawbana suggestion of aggregating thinking blocks per call. Otherwise, I'm not sure this mapping of tool calls to thinking is the right approach 🤔
intercept/messages/streaming.go
Outdated
| case anthropic.ThinkingBlock: | ||
| thoughtRecords = append(thoughtRecords, &recorder.ModelThoughtRecord{ | ||
| Content: variant.Thinking, | ||
| CreatedAt: time.Now(), |
There was a problem hiding this comment.
AFAIK, with streaming, the code waits until we get a stop block and processes all thinking blocks at that point, meaning they'll all have the same CreatedAt. I assume the ordering is still preserved by their position in the slice, so this probably doesn't matter, but worth noting that CreatedAt won't reflect when each block actually arrived. Could this be an issue?
There was a problem hiding this comment.
This is actually true of tool usages as well which are recorded at this stage, as well.
Postgres is microsecond-precise, so some records may be persisted with the same timestamp.
The timestamp for thoughts doesn't really matter, though; as long as it's associated to a tool call it'll be displayed correctly (i.e. before the tool call).
a0ad392 to
c2f9c79
Compare
a0e150e to
b2e4f03
Compare
8a13f42 to
a492e77
Compare
b2e4f03 to
732f3d2
Compare
158d8b0 to
25b5478
Compare
0773ea6 to
8db8a1c
Compare
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
8db8a1c to
7a0d206
Compare
3466ab6 to
4fd7ded
Compare
Signed-off-by: Danny Kopping <danny@coder.com>
7a0d206 to
87044c1
Compare
|
|
||
| // Clear after first use to avoid duplicating across | ||
| // multiple tool calls in the same message. | ||
| thoughtRecords = nil |
There was a problem hiding this comment.
We only track model thoughts in relation to tool use; all our thoughts are irrelevant in the same way that we don't store regular inference responses.
I don't like that some thoughts are irrelevant but all thoughts will be recorded with a first tool call regardless if thoughts are related to that call or not.
I see why trigger to record thoughts is existence of some tool call in response but I don't like connecting thoughts to tool call since there is no strong connection between them / some thoughts may be completely irrelevant.
Right now the two supported APIs do not "interleave" thoughts per tool use: all the thinking is done upfront which results in the multiple tool uses.
If thoughts are not interleaven between tool calls why there are recorded using RecordToolUsage? Because this is the only thing that is recorded per inner agentic loop iteration?
If my assumption about inner loop is correct I understand why this is done this way but recording all thoughts with first tool call feels like a hack. How about introducing new method RecordThoughts? In #203 there is already separate table for thoughts.
I fell like following structure better reflects reality:
- interception has:
- multiple loops, each loop has:
- one thoughts record
- can have multiple tool call records (when parallel tool calls are enabled).
- multiple loops, each loop has:
For each loop we could match thoughts to tool calls by time (since each loop iteration should be quite slow) or we could add new field that would store on which iteration of inner loop stuff happened.
With deprecation of inner loop we could simplify this into interception having one thought record + multiple tool call records.

Required for coder/coder#22676
Closes #168
This PR adds recording of model "thoughts" (sometimes call "reasoning"). This is only available for
/v1/messagesand/v1/responses. We only track model thoughts in relation to tool use; all other thoughts are irrelevant in the same way that we don't store regular inference responses.I've raised a PR downstack to allow parallel tool calls when there are no injected tools.
This leaves open the question of how we will associate model thoughts to these tools. Right now the two supported APIs do not "interleave" thoughts per tool use: all the thinking is done upfront which results in the multiple tool uses. Here are two examples:
/v1/messages/v1/responses